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Web content delivery & protection 
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2013: WAF development by request of Positive Technologies 
"Visionar" from Gartner magic quadrant'15 


° Web attacks K 


e L7 HTTP/HTTPS DDoS attacks "e 

> Nginx, HAProxy, etc. - perfect HTTP proxies, o^ й 
not HTTP filters нЕ 

> Netfilter works in TCP/IP stack (зо га) ide d 
=> HTTP(S)/TCPIIP stack | UV 


> Tempesta FW: 
* hybrid of HTTP accelerator & firewall 
° embedded into the Linux TCP/IP stack он теор hore 


Tempesta TLS 


https://hetdevconf.info/Ox12/session.html?kernel-tls-handhakes-for-https-ddos-mitigation 


› Part of Tempesta FW, | Г] == 1 


—— 
as 
=/ 


ier 
an open source RE pstream servers 
Application Delivery Controller P —— Ei 


Web cache 

Load balancer 

DDoS protection 

Web application security 
Application performance monitoring 


> Open source alternative to 
F5 BIG-IP or Fortinet ADC 
SSL/TLS offloading 


á “TLS CPS/TPS” |5 а Data compression 
common specification for Connections QoS 
network security appliances 
& ADCS 
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Linux kernel TLS handshaks 


> Very fast light-weight Linux kernel implementation 

° ...even for session resumption 

e there is modern research in the field 
>» Resistant against DDoS on TLS handshakes (asymmetric DDoS) 
› Privileged address space for sensitive security data 


° Varnish: TLS is processed in separate process Hitch 
http://varnish-cache.org/docs/trunk/phk/ssl. html 


* Resistance against attacks like CloudBleed 
https://blog.cloudflare.com/incident-report-on-memory-leak-caused-by-cloudflare-parser-bug/ 


Tempesta 


Why NIST p256? 


› ECDSA 


° RSA and NIST curves p256, p384, and p521 are the only allowed for 
CA certificates 
https://cabforum.org/baseline-requirements-documents/ 


e P256 Is the fastest NIST curve 
e P521 isn’t recommended by IANA 


https://www.iana.org/assignments/tls-parameters/tls-parameters.xmHitls-parameters-8 


* RSA Is slow and vulnerable to asymmetric DDoS 
https-//vincent.bernat.im/en/blog/2011 -ssl-dos-mitigation 


* Curve25519 
* Much faster than NIST p256 
* |n practice for ECDHE only c Tempesta 
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TLS libraries performance issues 


Copies, memory 
initializations/erasing, 
memory comparisons 


memcpy (),memset (), 
memcmp () and their 
constant-time analogs 


Many dynamic allocations 
Large data structures 
Some math is outdated 
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[kernel.kallsyms] 
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[kernel.kallsyms] 


Int. malloc 
__ecp_nistz256_mul_montx 
__ecp_nistz256_sgqr_montx 
sha256 block data order avx2 
ecp nistz256 avx2 gather w7 
int free 

malloc consolidate 

OPENSSL cleanse 

malloc 

do syscall 64 

free 

ecp nistz256 ord sqr montx 
ecp nistz256 point doublex 
_ ecp nistz256 sub fromx 

. Iemmove avx unaligned erms 
_ ecp nistz2506 mul by 2x 

. Jmemset avx2 unaligned erms 
aesni ecb encrypt 
eocp.nistzz56 point. ааах 

EVP MD CTX reset 

entry SYSCALL 64 
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Technologies 


The source code 
https://github.com/tempesta-tech/tempesta/tree/master/tls 


Still in-progress: we implement some of the algorithms on our own 
> Initially the fork of mbed TLS 2.8.0 (https://tls.mbed.org/) - X40 faster! 


* very portable and easy to move into the kernel 
* cutting edge security 


too many memory allocations (https://github.com/tempesta-tech/tempesta/issues/614) 
° big integer abstractions (https:/github.com/tempesta-tech/tempesta/issues/1064) 
inefficient algorithms, no architecture-specific implementations, ... 

> We also take parts from WolfSSL (https;/github.com/wolfSSL/wolfssl/) 


* very fast, but not portable 


° security https://github.com/wolfSSL/wolfssl/issues/3184 22 Tempesta 


ECDSA & ECDHE mathematics: 
Tempesta TLS, OpenSSL, WolfSSL 


» OpenSSL 1.1.1h 


256 bits ecdsa (nistp256) 36473 sign/s 
256 bits ecdh (піѕір256) 16620 op/s 


> WolfSSL (current master) 


ECDSA 256 sign 43260 ops/sec (+19%) 

ECDHE 256 agree 40878 ops/sec (+146%) 
> Tempesta TLS (full TLS handshake operation) 

ECDSA sign (nistp256): ops/s=38393 

ECDHE srv (nistp256): ops/s=13418 


» OpenSSL & WolfSSL don't include ephemeral keys generation 
(one more m * Goperation) 


Demo! 


› Tempesta TLS, Nginx-1.14.2/OpenSSL-1.1.1d, Nginx-1.17.8/WolfSSL 
> TLS 1.2 
• full handshakes 


e abbreviated handshakes 
» tls-perf 
https://github.com/tempesta-tech/tls-perf 


° establish & drop many TLS connections in parallel 
* like TLS-THC-DOS, but faster, more flexible, more options 


Technologies 


Data for proprietary vendors 


> BIG-IP 1$ only 30-50% faster than Nginx/OpenSSL/DPDK 
https://www. youtube.com/watch ?v=PIv8 7h8GtLc 


> Avi Vantage (VMware) makes ~2000 handshakes/second per 1CPU 


https://avinetworks.com/docs/latest/ssl-performance/ 
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Why faster? 


No memory allocations in run time 
No context switches 

No copies on socket I/O 

Less message queues 


Zero-copy handshakes state machine 
https://netdevcont. info/Ox12/session.html?kernel-tls-handhakes-for-https-ddos-mitigation 


State of the art cryptography mathematics 
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Elliptic curve cryptography 


Secp256r1: y? = хз - 3x + b defined over the field GF(p) 
p256 — 2256 = 2224 + 2192 + 296 _ 1 


The group law T. / qm 
| | a ыы / у ax+5 y " 

«negatives: ИР = (X, y), d 
then -P = (x, -y) кє | 

"ү" \ "dh om 
«addition: R = P + Q DL i | 
e doubling: R = P + P = 2*P \ | 
ECDSA: k — secure random, G — known point керо 


К * Gis used for the signature 


ECDH: d — private key, Q — public key 
shared secret: а * Q ovp ours 


Point multiplication 


OpenSSL: “Fast prime field elliptic-curve cryptography with 256-bit primes" by Gueron and Krasnov 
> Q = п * P -the most expensive elliptic curve operation 


for i in bits(m): 


О < point double (Q) 
if m, == 


О < point add(O, P) 


> Point multiplications in TLS handshake: 


° known point multiplication: precompute the table for doubled G 
° perfect forward secrecy ECDHE: generate keys С * d (d — random) 
e handshake: 2 known & 1 unknown point multiplications 


г Tempesta 


Point representation and coordinate systems 


http://www. hyperelliptic.org/EFD/g1p/auto-shortw-jacobian.html 


“Analysis and optimization of elliptic-curve single-scalar multiplication”, Bernstein & Lange, 
2007 


» Jacobian coordinates (rough estimations) 
e conversion overhead: 39*M + 4*S + ЗхІ (for w(-indow) = 4) 
e point addition (mixed) - 8*M + 3%*S, doubling - 2*M + 4*S 
> Affine coordinates (rough estimations) 
e point addition - 13*M + 4*S, doubling - 4*M + 5*5 
>» NIST 256 bits, = 256 / и = 64 Comb rounds (addition & doubling): 


* * * * * * 
64 (10*M + 7*S) << 64 (17*М + 9*5) 27 Tempesta 


Point addition 


http://www. hyperelliptic.org/EFD/g1p/auto-shortw-jacobian.html#addition-add-200 7-bl 


› ex. addition in Jacobian coordinates (cost: 11M + 5S) 
А = (X1, Yı, 21), В = (X2, у», 22), then С = A + B = (хз, уз, 23) 15 


U. = X: Z° 
U; = X;Z 1 
Si = Үл 723 
6? = ҮЛ? 
H = U2- U: 
R = S2 - Sı 
Z3 = HZiZ 


xs = R? - H? - 2UuH? 
= (UiH? - X3)R - SiH? 
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The cost 


» Modular multiplication (M) is the most expensive basic scalar operation 


>» Modular squaring (S) faster than M, usually о. 8M (Montgomery) 
(0.9 for optimized FIPS due to more expensive modular reduction) 


> Modular inversion (1) is very expensive, about 100M 


Technologies 


Modular arithmetics 


› ex. prime field F(29) 


addition: 17 + 20 = 8 since 
Subtraction: 17 —20= 26 since 
multiplication: 17 * 20 = 21 since 
inversion: 17-1 = 12 since 


› Montgomery reduction (the most used) 


37 mod 29 = 8 

—3 mod 29 = 26 
340 mod 29 = 21 
17 · 12 mod 29 = 1 


* there is some overhead, but each modular operation is cheaper 


» FIPS reduction 


* Can be faster if small number of modular operations is used 


* There are optimization techniques, 


e.g."Low-Latency Elliptic Curve Scalar Multiplication" Bos, 2012 
° But still about 6596 slower than Montgomery reduction e Tempesta 


Montgomery multiplication in P256 


“Montgomery Multiplication”, Henry S. Warren, Jr. 

› Fast 256-bit integer multiplication with modular reduction on P256 
‚аб <т  (m-modulus P256) 

> Set n = 2^9 


> Transform multipliers to Montgomery domain (overhead): 
а’ = ап тоа т b’ = bn mod m 


> Fast multiplication with reduction: и = a'b'/n mod m 
e compute only 256 bits of (a’b’ + (-m?a'b' mod n)m)/n 
e Ми > m,then и — u - m (unconditionally, carry as a mask) 
> Convert to ordinary number: v = un? mod m c Tempesta 
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The math layers 


> Different multiplication algorithms for fixed and unknown point 


° "Efficient Fixed Base Exponentiation and Scalar Multiplication based on a Multiplicative 
Splitting Exponent Recoding", Robert et al 2019 


> Point doubling and addition - everyone seems use the same algorithms 
> Jacobian coordinates: different modular inversion algorithms 

° "Fast constant-time gcd computation and modular inversion", Bernstein et al, 2019 
* Modular reduction for scalar multiplications: 


e Montgomery has overhead vs FIPS speed: if we use less multiplcations it could make 
sense to use different reduction method FIPS (seems deadend) 


e "Low-Latency Elliptic Curve Scalar Multiplication" Bos, 2012 
=> Balance between all the layers СА Tempesta 


Example of math layers balancing 


› For w=5 we need 52 point additions for an unknown point multiplication 
> Jacobian coordinates addition takes 11M + 55 
› Affine-Jacobian coordinates addition takes 8M + 35 

e about 4. 4M cheaper if S = 0.8M 


* requires 2 coordinates normalizations for Comba precomputation 
e coordinates normalization: 2" - + * (6M + 1S) + 1I 


› Almost the same for S = 0.9M and fast inversion I < 100M 
> Montgomery arithmetics (S = 0.8M): 
* ECDHE +28% and ECDSA +6% performance 
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Side channel attacks (SCA) resistance 


> Timing attacks, simple power analysis, differential power analysis etc. 


› Protections against SCAs: 


Constant time algorithms 

Dummy operations 

Point randomization 

e.g. modular inversion 741 vs 266 iterations 


> RDRAND allows to write faster non-constant time algorithms 


SRBDS mitigation costs about 97% performance 


https://software. intel. оо eee Rue 
r-data-sampling?fbclid-IwAR1Ifj 3ÓZuAtNOabKkJ3vFItBLSvOnMqIlxH21-QeN5KB-aj154J1BCJa9ILk 
о 
oekUuw/7KMRaHB 7K ThpqzotlrifxX2GCW3HAvwt5Kb1p9xpL Ko 


Memory usage & SCA 


› ex. ECDSA precomputed table for fixed point multiplication 


e mbed TLS: ~8KB dynamically precomputed table, point 
randomization, constant-time algorithm, full table scan 


* OpenSSL: ~150KB static table, full scan 


e WolfSSL: ~150KB, direct index access (fixed in the new version) 
https:/github.com/wolfSSL/wolfssi/issues/3184 


=> 150KB is far larger than L1d cache size, so many cache misses: 


Technologies 


Big Integers (aka MPIs) 


“BigNum Math: Implementing Cryptographic Multiple Precision Arithmetic”, by Tom St Denis 
> All the libraries use them (not in hot paths), mbed TLS overuses them 


» linux/lib/mpi/,linux/include/linux/mpi.h 


typedef unsigned long int mpi_limb_t; 


struct gcry_mpi í 


int alloced; y 
int nlimbs; / * 
int nbits; [з 


int sign; f= 

unsigned flags; 

mpi limb t *d; [* 
}; 


array size (# of allocated limbs) */ 

number of valid limbs */ 

the real number of valid bits (info only) */ 
indicates a negative number */ 


array with the limbs */ 


> Need to manage variable-size integers 


=> size-specific assembly implemetations 
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// a = 
// x[0] 
// x[1] 


void s mp add(unsigned Long *a, 


Easy assembly 


а + b 
is the less significant Limb, 
is the most significant limb. 


unsigned long carry; 


a[o] 


+= b[9]; 


carry = (a[0] < b[0]); 


а[1] 
j 


+= b[1] + carry; 


// Pointer to a is in %RDI, pointer to b is in %RSI 


movq 
movq 


addq 
addc 


movq 
movq 


(96rdi), 96r8 
8(%rd1), %r9 


(%rsi), %r8 
8(%rs1), %r9 


(%r8), (%rd1) 
(%r9), 8(96rdi) 


// add with carry 
// use the carry in the next addition 


e 


unsigned Long *b) { 
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Open questions and further research 


› Ice Lake CPUs have negligeable downclocking оп AVX-512 
https://travisdowns. github. io/blog/2020/08/1 9/icl-avx512-freg.html 


› Parallel Montgomery computations 
J.W.Bos, "Montgomery Arithmetic from a Software Perspective", 2017 


° SIMD multiplications & squarings of two and more products 
° Interleaved Montgomery multiplications 
› Better methods for point multiplications 


< Tempesta 


v 


v 


v 


v 


v 


v 


Going to the Linux kernel upstream 


RPC-over-TLS 
https://datatracker.letf.org/doc/draft-ietf-nfsv4-rpc-tls/ 


Nginx, HAProxy, Varnish etc. can benefit from the acceleration! 
TLS 1.3 

Client & server side implementations 

Fallback to a user-space TLS library on ClientHello 

Details and the discussion 


° https://netdevconf.Info/0x14/session.html?talk-performance-study-of-kernel-TLS-handshakes 
° https://github.com/tempesta-tech/tempesta/issues/1433 
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TODO 


> More cryptography mathematics performance optimizations 
https://github.com/tempesta-tech/tempesta/issues/1064 
https://github.com/tempesta-tech/tempesta/issues/1335 


> TLS 1.3 
https://github.com/tempesta-tech/tempesta/issues/1031 


> Moving to the kernel asymmetric keys API 
https://github.com/tempesta-tech/tempesta/issues/1332 


» The Linux kernel /crypto API performance issues 


° SHA-256 (crucial for TLS handshake) 30-100% slower than OpenSSL 
https://github.com/tempesta-tech/tempesta/issues/1483 


° Extra copying and memory allocations in KTLS 
https://github.com/tempesta-tech/tempesta/issues/1064 
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Netdev papers about Tempesta TLS 


>» "Kernel HT TP/TCP/IP stack for HTTP DDoS mitigation", Netdev 2.1, 
https://hetdevconf.info/2.1/session.html?krizhanovsky 


>» "Kernel TLS handshakes for HTTPS DDoS mitigation", Netdev 0x12, 
https://hetdevconf.info/Ox12/session.html?kernel-tls-handhakes-for-https-ddos-mitigation 


> "Performance study of kernel TLS handshakes”, Netdev 014, 


Ааа зна аена 
акеѕ 


СА Tempesta 


Thanks! 


Your donations make the TLS handshakes upsream happen earlier! 
https://github.com/sponsors/tempesta-tech 


httpS://tempesta-tech.com 


ak@tempesta-tech.com 


Technologies 


